Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] WiFi periodic scanning #2088

Closed
wants to merge 4 commits into from
Closed

Conversation

mcspr
Copy link
Collaborator

@mcspr mcspr commented Dec 29, 2019

fix #2064
based on xoseperez/justwifi@master...mcspr:better-networks

sort of works. low interval of 10sec and two APs seems to work, jumping from one to the other.
one strange issue is when AP disappears by plugging it off, connection logs the old network but actually connects to the other one, not sure if that is some feature of SDK or I have a typo somewhere.

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 29, 2019

@xoseperez Unless you had some other ideas about scanning approach... JustWifi is also refactored quite a bit, but mostly in regards to internal flow.

@davebuk
Copy link
Contributor

davebuk commented Dec 12, 2020

Are there any plans to get this merged or does it require more testing? I had a situation the other day after a power cut. The devices connected to the first AP they found with the same SSID but not the strongest. I had to do a manual re-connect per device so it then connected to the strongest SSID.

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 13, 2020

Not in a way it is written here. I'll need to update the logic a bit.

As mentioned in the original topic, I did add RPN ops as sort-of workaround:

30000u every_ms -60 rssi ge end disconnect

i.e. every 30 seconds check current RSSI and disconnect when it's less than -60. But, it does not take into an account the surrounding networks (or if there are any at all)

@davebuk
Copy link
Contributor

davebuk commented Dec 15, 2020

I'll add RPN to some builds at some point and see if the problem devices are more stable.

I know I can send a reboot command via MQTT, but is there a reconnect command? I know the device has to have a working WiFi connection to receive it, but I can monitor the RSSI level in openHab and maybe then send a reconnect command if too low.

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 15, 2020

I know I can send a reboot command via MQTT, but is there a reconnect command? I know the device has to have a working WiFi connection to receive it, but I can monitor the RSSI level in openHab and maybe then send a reconnect command if too low.

There is one via TERMINAL_WEB_API_SUPPORT=1, you can just send it terminal commands. Just beware of the current bug - you need a 'value=value_can_be_empty' key present as well as the 'line=command'

$ curl -v -XPUT -H "Api-Key: something" http://1.2.3.4/api/cmd --data "line=wifi.reset&value="

I'll need to update the basic /api/rpc with the same action

edit: however, it will leak memory when doing the wifi reset. hm...

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 15, 2020

:oops: I misread the MQTT part as HTTP. There is TERMINAL_MQTT_SUPPORT=1 which adds {root}/cmd/set. Similarly, it accepts a single line on input as payload and runs it.

@davebuk
Copy link
Contributor

davebuk commented Jan 1, 2021

Not in a way it is written here. I'll need to update the logic a bit.

As mentioned in the original topic, I did add RPN ops as sort-of workaround:

30000u every_ms -60 rssi le disconnect

i.e. every 30 seconds check current RSSI and disconnect when it's less than -60. But, it does not take into an account the surrounding networks (or if there are any at all)

I have tried a device using the above rule but the wifi disconnects every 30 seconds regardless of RSSI level. The device currently has a value of around -56. Does the rule need some additional code or an 'IF' statement?

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 1, 2021

Yes, missing piece is end between disconnect and the condition. le will put either true or false on the stack, which I thought disconnect checks for... but, it does not

There is no real if atm:
mcspr/rpnlib#1
If there was, rssi -60 le { disconnect } if would be possible instead of using the end

@davebuk
Copy link
Contributor

davebuk commented Jan 1, 2021

With rssi around -56, the following still disconnects which I think is wrong.
30000u every_ms -60 rssi le end disconnect

The following stops the disconnect.
30000u every_ms -60 rssi ge end disconnect

I believe this should equate to if -60 >= rssi (-56) then stack equals 0, so end which doesn't make sense?

Am I understanding the RPN implementation correctly?

@davebuk
Copy link
Contributor

davebuk commented Jan 1, 2021

Restarting the wifi near that device I believe triggers the rule, but the device seemed to do a restart with the following exception:

[550602] [MAIN] Last reset reason: Exception
[550603] [MAIN] Last reset info: Fatal exception:9 flag:2 (Exception) epc1:0x4024e4c8 epc2:0x00000000 epc3:0x00000000 excvaddr:0x3fff8d29 depc:0x00000000

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 1, 2021

Yes, where stack A B CONDITION, the order is A CONDITION B
https://github.com/mcspr/rpnlib/blob/a4e5cba8f89aba05fcde291333c6019e2e6b292e/src/rpnlib_operators.cpp#L256
(should mention it the README there. I did write if correctly though, but not editing the message above it correctly. Fixed :)

No exception here. Can you check how it decodes the crash dump?
https://github.com/xoseperez/espurna/blame/dev/.github/ISSUE_TEMPLATE/bug_report.md#L41-L59

@davebuk
Copy link
Contributor

davebuk commented Jan 2, 2021

Could you point me towards what I need to do :)

crash in the webui console doesn't show anything.

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 2, 2021

crash force? (it marks crash dump as 'read' when called once)

@davebuk
Copy link
Contributor

davebuk commented Jan 2, 2021

Yep

Reason of restart: 2

Exception (9):
epc1=0x4024e4c8 epc2=0x00000000 epc3=0x00000000 excvaddr=0x3fff8d29 depc=0x00000000

>>>stack>>>

ctx: todo
sp: 3ffffd10 end: 3fffffc0 offset: 0000
3ffffd10:  00000020 00000014 00000000 40100fc9 
3ffffd14:  00000001 ffffffff 3ffffda0 00000000 
3ffffd18:  00000000 00000010 00000020 0000000b 
3ffffd1c:  3fff31f7 00000020 3ffffe10 4023b09a 
3ffffd20:  3fff49a4 00000010 3ffffe10 3fff7974 
3ffffd24:  00000002 3fff0c5c 00000002 402231dd 
3ffffd28:  3ffffda0 00000000 00000014 3ffffe70 
3ffffd2c:  3fff7594 3fff7594 00000020 00000015 
3ffffd30:  00000005 00000001 3fff0c5c 4022ff2b 
3ffffd34:  00001e33 00bd000b 3ffffd15 4023b011 
3ffffd38:  000000bd 00000001 3ffffe70 00000001 
3ffffd3c:  3fff6804 00000020 3fff0c14 4023b760 
<<<stack<<<

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 2, 2021

Decoding script should tell what those addresses mean. Given that you saved the build directory, there should still be .elf file alongside the .bin, addresses are not constant and depend on the build config

@davebuk
Copy link
Contributor

davebuk commented Jan 2, 2021

firmware.zip
Zipped .elf file attached.

Can this be viewed in a standard text editor or does it require specific software?

I can open it in Notepad++ but can't find any of the addresses in the file listed above.

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 2, 2021

See bug-report.md link, it is explaining the decoder. issue2088.txt contains the c/p text of the exception, starting with Exception (9):... and ending with <<<stack<<<

> git clone https://github.com/mcspr/EspArduinoExceptionDecoder
> py -3 EspArduinoExceptionDecoder\decoder.py -t ~\.platformio\packages\toolchain-xtensa\ -e firmware.elf .\issue2088.txt
Exception: 9 (LoadStoreAlignmentCause: Load or store to an unaligned address)

epc1:     0x4024e4c8: tcp_write at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/tcp_out.c:715
epc2:     0x00000000
epc3:     0x00000000
excvaddr: 0x3fff8d29
depc:     0x00000000

stack:

ctx: todo
sp:       0x3ffffd10
end:      0x3fffffc0
offset:   0x00000000

0x40100fc9: realloc at C:\users\server\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\umm_malloc/umm_malloc.cpp:586
0x4023b09a: String::changeBuffer(unsigned int) at C:\users\server\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/WString.cpp:182 (discriminator 3)
0x3fff0c5c: ?? at C:\ESP Software\Firmware Builds\Espurna\Dev\espurna-dev\code/espurna/mqtt.cpp:42
0x402231dd: AsyncClient::add(char const*, unsigned int, unsigned char) at C:\ESP Software\Firmware Builds\Espurna\Dev\espurna-dev\code/libraries\ESPAsyncTCP\src/ESPAsyncTCP.cpp:256
0x3fff0c5c: ?? at C:\ESP Software\Firmware Builds\Espurna\Dev\espurna-dev\code/espurna/mqtt.cpp:42
0x4022ff2b: AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int) at C:\ESP Software\Firmware Builds\Espurna\Dev\espurna-dev\code/libraries\AsyncMqttClient\src/AsyncMqttClient.cpp:834
0x4023b011: String::invalidate() at C:\users\server\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/WString.cpp:140
0x3fff0c14: ?? at C:\ESP Software\Firmware Builds\Espurna\Dev\espurna-dev\code/espurna/mqtt.cpp:86
0x4023b760: String::concat(String const&) at C:\users\server\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/WString.cpp:320

Can you also try adding yield after disconnect? This seems to have something to do with the mqtt publishing stuff when disconnected, but mqtt.cpp lines do not really make any sense

@davebuk
Copy link
Contributor

davebuk commented Jan 2, 2021

I can see the details now for the decoder when looking on a normal screen, I could only see the left hand side of the page on my phone.

yield seemed to work.

[264554] [MAIN] Boot version: 31
[264554] [MAIN] Boot mode: 1
[264555] [MAIN] Last reset reason: Reboot from web interface

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 2, 2021

Can you also share the rules list? I'd guess there's mqtt_send somewhere that get's caught by surprise
disconnect will do yield automatically in the future patch, but for now it is required

@davebuk
Copy link
Contributor

davebuk commented Jan 2, 2021

There aren't any other rules defined on this device.

@davebuk
Copy link
Contributor

davebuk commented Jan 3, 2021

FYI, the WiFi AP was offline for at least 30 minutes this morning. The device running the rule had gone into ESPURNA AP mode.

A power cycle made the device reconnect back to the previously known AP.

If I'd left it longer, would the device try and reconnect to the known AP or, would it stay in the ESPURNA AP mode until rebooted?

@mcspr
Copy link
Collaborator Author

mcspr commented Jan 3, 2021

It should scan ~3 minutes

#define WIFI_RECONNECT_INTERVAL 180000 // If could not connect to WIFI, retry after this time in ms

@mcspr
Copy link
Collaborator Author

mcspr commented Mar 21, 2021

edit: closing as out-of-date

So I've been looking at wifi stuff yet again, and I think I want to get rid of justwifi part altogether (or at least until I can figure out what would be a nicer library API). Obviously, it should also be better than the RPN script, since it would have access to the BSSID info as well and won't reconnect to the same poor AP

Question remains though, what exactly is the reconnect policy should be:

  • find a better current network i.e. only search for the exact same SSID that is currently used
  • maybe make an rpn network scanner returning how much there are available networks for the SSID
  • find a better network across all configs (ssid0...ssid4, WIFI1_SSID...WIFI5_SSID)

And RSSI should probably be an average and not just plain comparison

@mcspr mcspr closed this Mar 21, 2021
@mcspr
Copy link
Collaborator Author

mcspr commented Mar 21, 2021

Or, some parameter for connection where it won't allow RSSI less than a certain value
(which also, in theory, works with or without scan mode enabled, it is a ESP SDK parameter which we could change)

@mcspr mcspr deleted the wifi/periodic-scan branch March 21, 2021 12:10
@davebuk
Copy link
Contributor

davebuk commented Mar 21, 2021

A quick test of two of my devices. They are both connected without any issues for over 30 days. Ones RSSI was -80 and the other -81. Clicking the Reconnect option in the webUI, both changed to a better, stronger channel. One though went to -70 and the other -46.

Obviously, depending on the location of a device WRT the known/configured APs, an arbitrary figure to force reconnect may not be the best option.

I'd have thought that if the device could scan periodically (10 or 30 mins say), get RSSI values for configured APs and compare. If there is a stronger signal, run the reconnect code.

I'm not sure what happens if the stronger AP doesn't have access to the router? Say if the WiFi of the AP is working but can't communicate for some reason?

@mcspr
Copy link
Collaborator Author

mcspr commented Mar 21, 2021

re. just above, I meant why solve the reconnection problem when we can just ignore bad APs and wait for a good one 🤔
unlike something like laptop, device does not move most of the time and it could just do the normal connection loop (maybe slightly shorten it though)

comms are a whole other question here. we do check if there's an IP, but none of the network services are actually required to work for the connection to be considered ok

@davebuk
Copy link
Contributor

davebuk commented Mar 22, 2021

Like you talked about above, I'm not sure at what RSSI value should be considered poor. My device that was on -80 to -70 is communicating fine although the webUI can become unresponsive at times. Maybe a trigger of <-60 it should try and reconnect. Could there be a problem if the device, due to distance, is always around -70? Wouldn't it constantly try reconnecting even though it's never going to get a stronger signal?

Re: comms. For ease of management I run my 30ish devices using static IPs. That way I know which IP to use to get to a specific device. My router doesn't have enough static IP associations capacity for too many.

@davebuk
Copy link
Contributor

davebuk commented Mar 22, 2021

I don't know what thresholds a mobile phone would use to auto scan for stronger configured WiFi networks?

@mcspr
Copy link
Collaborator Author

mcspr commented Mar 23, 2021

There's a very lengthy description of what android does - https://source.android.com/devices/tech/connect/wifi-network-selection?hl=en
and an ios one, specific to roaming rules - https://support.apple.com/en-asia/HT203068
Good RSSI number is > -63...-73

It would do the loop, true... Maybe it could give up in a cycle or two, and then run the scan check while connected (i.e. do what justwifi thing here did).

@mcspr
Copy link
Collaborator Author

mcspr commented Mar 28, 2021

I scrapped the threshold idea for the initial connection itself, which seems like an overkill for a semi-stable connection. It will be dropped if the device will drop too much traffic, so another scan will happen that way at some point.
WIP I have right now seems to work ok, but it may still have some soft-lock bugs... Will push something during the week right into the dev

Also about static IPs and comms. In theory, there could be a check for 'essential' services like MQTT or something else and it could drop the connection when those are not available. But, it feels like it would add a lot of saved state and idk if it is really that useful
(like, what device is supposed to do when none of the available networks work? it could re-try connections loop, but in case of some outage, just waiting seems to have the exact same result in the end)

@mcspr
Copy link
Collaborator Author

mcspr commented Mar 31, 2021

See dev

#ifndef WIFI_SCAN_NETWORKS
#define WIFI_SCAN_NETWORKS 1 // Perform a network scan before connecting and when RSSI threshold is reached
#endif
#ifndef WIFI_SCAN_RSSI_THRESHOLD
#define WIFI_SCAN_RSSI_THRESHOLD -73 // Consider current network for a reconnection cycle
// when it's RSSI value is below the specified threshold
#endif
#ifndef WIFI_SCAN_RSSI_CHECKS
#define WIFI_SCAN_RSSI_CHECKS 3 // Amount of RSSI threshold checks before starting a scan
#endif
#ifndef WIFI_SCAN_RSSI_CHECK_INTERVAL
#define WIFI_SCAN_RSSI_CHECK_INTERVAL 60000 // Time (ms) between RSSI checks
#endif

is the gist of it. It will connect to first available network. It will poll RSSI value every ~1minute, after 3rd time it is bad - it will switch into scanning mode and try to find something better, ignoring the current network ofc. Have not really tested this though besides a dev board. My main concerns are possibility to soft-lock the device in some weird state and newish approach to do the connection really early on boot. If this works, ok :) If it does not, pls open an issue

@davebuk
Copy link
Contributor

davebuk commented Apr 15, 2021

Initial testing with two devices and it seems to work well.

Device was connected to one channel around -56. WiFi AP turned off near that device and it connected to a different AP on another channel at -76. WIFI AP turned back on and after approx. 3 mins, the device had connected back to the original AP.

Can the WIFI_SCAN_RSSI_THRESHOLD be adjusted in the webUI? Either using keys or within the WiFi section? It might be useful if the threshold could be adjusted per device rather than having to re-build per device.

@mcspr
Copy link
Collaborator Author

mcspr commented Apr 19, 2021

Threshold is exposed as a setting, but not (yet) displayed in the UI:

constexpr int8_t scanRssiThreshold() {
return WIFI_SCAN_RSSI_THRESHOLD;
}

int8_t scanRssiThreshold() {
return getSetting("wifiScanRssi", wifi::build::scanRssiThreshold());
}

(will try to squeeze it into another patch related to settings, right now the idea is to show when calling wifi to see all of things under the wifi 'namespace')
Amount of retries is not, though, it is a build-time setting at least for now.

Does it behave normally when neither APs are working and it needs to wait to reconnect longer? Does fallback still work?

@davebuk
Copy link
Contributor

davebuk commented Apr 20, 2021

What length of time should each check take? I isolated the WiFi and after approx 4 minutes I'd lost connection. Re-instated the WiFi but the device didn't connect for at least 2 mins and the espurna AP didn't come up. A power cycle re-boot reconnected back to the known WiFi as normal.

@mcspr
Copy link
Collaborator Author

mcspr commented Apr 20, 2021

When disconnected from the AP, retries start in about 10-15seconds. It attempts to connect 3 times (around ~4seconds inbetween), then attempts again in 2 minutes. For a single network config, fallback AP should be there in around 20 seconds

There's a small bug in disconnection detection though - when attempting the connection, timeout happens too early and it schedules another connection while already connected... device will needlessly disconnect->connect another 2 times, but will work ok after that happens. Not being able to connect brings up the fallback AP , not yet sure about that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Periodically scan WiFi networks find the best one
2 participants